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Heterogeneous Tensor Decomposition for 
Clustering via Manifold Optimization 

Yanfeng Sun, Junbin Gao, Xia Hong, Bamdev Mishra and Baocai Yin 


Abstract —Tensors or multiarray data are generalizations of matrices. Tensor clustering has become a very important research topic 
due to the intrinsically rich structures in real-world multiarray datasets. Subspace clustering based on vectorizing multiarray data has 
been extensively researched. However, vectorization of tensorial data does not exploit complete structure information. In this paper, we 
propose a subspace clustering algorithm without adopting any vectorization process. Our approach is based on a novel heterogeneous 
Tucker decomposition model. In contrast to existing techniques, we propose a new clustering algorithm that alternates between different 
modes of the proposed heterogeneous tensor model. All but the last mode have closed-form updates. Updating the last mode reduces 
to optimizing over the so-called multinomial manifold, for which we investigate second order Riemannian geometry and propose a 
trust-region algorithm. Numerical experiments show that our proposed algorithm compete effectively with state-of-the-art clustering 
algorithms that are based on tensor factorization. 

Index Terms —Tensor Clustering, Tucker Decomposition, Heterogeneous Tensor Decomposition, Manifold Optimization, Multinomial 
Manifold 

- > - 


1 Introduction 

I N the last two decades, the advance of modem sens¬ 
ing, networking, communication and storage tech¬ 
nologies have paved the way for fhe availabilify of mulfi- 
dimensional dafa wifh high dimensionalify. For example, 
remofe sensing is producing massive mulfidimensional 
dafa fhaf need fo be carefully analyzed. One of fhe 
characferisfics of fhese giganfic dafasefs is fhaf fhey offen 
have a large amounf of redundancies. This mofivafes fhe 
developmenf of a low-dimensional represenfafion fhat 
besf assisfs a range of learning fasks in order fo avoid fhe 
so-called "curse of dimensionalify" Q. Many dafa pro¬ 
cessing fasks involve manipulafing mulfi-dimensional 
objecfs. For example, video dafa Q can be regarded as 
an objecf of pixel locafion in fwo dimensions plus one 
dimension in fime. Similar represenfafion can be seen 
in manipulafing remofe sensing dafa ||^. In analyzing 
personalized webpages, fhe dafa is usually represenfed 
as a fhird order dafasef wifh fhree dimensional modes 
of users, query words and webpages, respecfively @.In 
document clustering, one presents the dataset in a three- 
way format of aufhors, ferms and times ©,@. _ 

The multi-dimensional data are known as tensors Q, 
Q, where data elements are addressed by more than 
two indices. An A-order tensor is an element of fhe 
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fensor producf of N vecfor spaces. A 2D mafrix is an 
example of fhe 2nd-order fensor. Similarly, hyperspecfral 
imagery |[^ is nafurally a fhree-dimensional (3D) dafa 
cube confaining bofh spafial and specfral dimensions. 
Simulfaneously considering bofh specfral and spafial 
strucfures of hyperspecfral dafa in clusfering or clas- 
sificafion lead fo superior resulfs, e.g., in p7) . To fhis 
end, we prefer freafing, e.g., a 3D cube as a whole. In 
order fo circumvent the issue of high dimensionalify, 
a sfrafegy is fo compress fhe dafa while capfuring fhe 
dominanf frends or fo find fhe mosf suifable "sparse" 
represenfafion of fhe dafa. 

Dafa clusfering is one of fhe widely used dafa mining 
fechniques p0| . Many clusfering mefhods consider each 
dafa as a high dimensional mafhemafical vecfor A t 5 rpi- 
cal way is fo pre-process fhose high dimensional vecfors 
with dimensionality reduction techniques such as princi¬ 
pal component analysis (PCA) pT] . For dealing with ten¬ 
sorial data, a conventional step is to first vectorize them 
before any analysis is applied. For example, fhis pro¬ 
cedure is pracficed fo converf image dafa, a 2nd-order 
fensor, info vecfors for processing. Nof surprisingly, fhis 
sfrafegy breaks higher order dependencies fhaf may be 
presenf in fhe dafa. Recenfly new approaches fhaf are 
capable of direcfly processing sfrucfural informafion of 
fensorial dafa have been proposed, e.g., fensorial dafa 
sfrucfure is exploifed in compufer vision applicafions 
1 13 1, PH and machine learning pH / p6) - 

Tensorial dafa have a large amounf of redundan¬ 
cies. If is desired fo have a mechanism fo reduce such 
redundancies for fhe sake of efficienf applicafion of 
learning algorifhms. We use fhe general ferm dimen¬ 
sionalify reducfion fo describe such fechniques or mod¬ 
els. There exisf various fensor decomposifion models, 
amongsf which fhe CANDECOMP (canonical decom- 
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position)/PARAFAC (parallel factors) or in short CP 
decomposition {Tt) and the Tucker decomposition are two 
fundamental models for fensor decomposifion (refer fo 
fhe survey paper |j^ for defails). If should be nofed fhaf 
fhe CP decomposifion is a special case of fhe Tucker 
decomposifion Q, where fhe facfor mafrices have same 
number of columns and fhe core fensor is superdiagonal, 
which means fhaf every mode of fhe fensor is of fhe 
same size and ifs elemenfs remain consfant under any 
permufafion of fhe indices. Many ofher decomposifion 
algorifhms/ models can be viewed as special formafs 
of fhe CP and Tucker decomposifion. For example, fhe 
higher order SVD algorifhm (HOSVD) p8) , an exfension 
of fhe classical SVD, is a special case of fhe general 
Tucker decomposifion of a fensor in which fhe core 
fensor is of fhe same dimension as fhe fensor fo be 
decomposed and all fhe mode mafrices have orthonormal 
columns. Similarly, fhe classical PCA has several exfen- 
sions for fensorial dafa. The generalized fensor PCA 
(GND-PCA) seeks a shared Tucker decomposifion for all 
fhe given fensors in which the core tensors are different 
but the matrix factors along each mode are orthogonal. 
This decomposition procedure is also called the higher- 
order orthogonal iteration (HOOI) algorithm in 

In applications where data are non-negative such as 
images, the non-negative matrix factorization (NFM) has 
proven to be a successful approach for defecfing essenfial 
feafures in fhe dafa p0| . Several efficienf algorifhms 
have been proposed \21\ , p2| . Recenfly NFM has been 
exfended fo non-negafive fensor facforization (NTF) and 
has been invesfigafed in p^ , p3) , p4) , p^ , p6) . 

Anofher frend in fensor decomposifion research is fo 
introduce more structures in the decomposition model 
itself. Recently, Zhang et al. |27| considered Tri-ONTD 
(Tri-factor orthogonal non-negative tensor decomposi¬ 
tion), a new tensor decomposition model. The funda¬ 
mental aim of fhis model is fo discover common char- 
acferisfics of a series of mafrix dafa. A sfraighfforward 
applicafion of Tri-ONTD is fo idenfify clusfer sfrucfures 
of a dafasef. The core idea behind fhis model is based on 
fhe cenfroid-based clusfering algorifhms such as fhe k- 
means algorifhm. The idea of infroducing new sfrucfures 
in tensor decomposition can also be seen in | [28) in the 
context of image represenfation. 

Offen fensor clusfering fasks are formulafed as op- 
fimizafion problems on specific fensor decomposifion 
models p^, p0|, [311, p2). For example, when impos¬ 


sulfing opfimization problem is on fhe Stiefel manifold |33, 


ing some specific consfrainfs, like orfhogonalify in fhe 
HOOI algorifhm p^ , fhe resulfing problems reduce fo 
opfimization over matrix manifolds p3| . In fhe case of fhe 
HOOI algorifhm for fhe HOSVD decomposifion, fhe re- 


Secfion 3.3]). While fhe opfimizafion problem in HOSVD 
admifs a closed-form solution under fhe leasf squared 
error criferion, compufing a closed-form solufion is nof 
possible for fhe fensor clusfering problem fhaf we con¬ 
sider in fhis paper. Mosf existing algorifhms avoid fhis 
issue by reformulafing if info an opfimizafion problem 


over fhe "flaf" Euclidean space wifh some freatment, e.g., 
by infroducing a regularizafion ferm. 

Recenf years have wifnessed significant development 
of Riemannian opfimizafion algorifhms on mafrix man¬ 
ifolds such as fhe Sfiefel manifold, fhe Grassmann man¬ 
ifold, and fhe manifold of positive definife mafrices 
1331, p^ , p^ . The Riemarmian opfimizafion framework 
endows a mafrix manifold consfrainf wifh a Rieman¬ 
nian manifold structure. Goncepfually if franslafes a con- 
sfrained opfimizafion problem info an unconstrained op¬ 
timization problem on a Riemannian manifold. Since fhe 
Riemannian opfimizafion framework is direcfly based on 
nonlinear manifolds, one can eliminafe fhose consfrainfs 
such as orfhogonalify fo obfain an unconsfrained opfi¬ 
mizafion problem fhaf, by consfrucfion, will only use 
feasible poinfs. The recenf successful applications of fhe 
Riemannian opfimizafion framework in machine learn¬ 
ing, compufer vision and dafa mining, include low rank 
opfimizafion p4) , p^, p6|, esfimafion p7) , Riemannian 
dicfionary learning ]38[, |3^, and compufer vision fasks 
| |40} , fo name a few. 

In fhis paper, we propose a novel subspace clusfering 
algorifhm fhaf exploifs fhe fensorial sfrucfure of dafa. To 
fhis end, we infroduce a new heferogeneous Tucker de¬ 
composifion model. The proposed clusfering algorifhm 
alfemafes befween differenf modes of fhe proposed hef¬ 
erogeneous fensor model. All buf fhe lasf mode have 
closed-form updafes. Updafing fhe lasf mode reduces 
fo opfimizing over fhe multinomial manifold, defined in 
Secfion|^ for which we invesfigafe second order Rieman¬ 
nian geomefry and propose a frusf-region algorifhm. The 
mulfinomial manifold is given a Riemarmian manifold 
sfrucfure by endowing if wifh fhe Fisher information 
metric | [44) , | |46) . The Fisher information metric gives the 
multimonial manifold a differentiable sfrucfure, i.e., if 
ensures fhat fhe boundary is "scaled" fo infinify. 

The contribufion of fhis paper is twofold. Firsf, we 
propose a heferogeneous Tucker decomposifion model 
for fensor clusfering. Second, we invesfigate fhe Rieman¬ 
nian geomefry of fhe mulfinomial manifold and apply 
fhe Riemannian frusf-region algorifhm fo fhe resulfing 
nonlinear clusfering problem over fhe mulfinomial man¬ 
ifold. 

The paper is organized as follows. We infroduce fhe 
nofafions for fensor represenfation and operafions used 
in fhis paper in Secfion |2.1| Secfion |2.2| infroduces fhe 
clusfering scheme based on fhe proposed heferogeneous 
Tucker decomposifion model. The associafed opfimiza¬ 
fion problems are discussed in Secfion 2.3 Secfion 
explores fhe Riemannian geomefry for fhe mulfinomial 
manifold and develops all fhe necessary opfimizafion- 
relafed ingredienfs. Secfion |^presenfs fhe algorifhm pro¬ 
cedure for fhe fensor clusfering including fhe proposed 
Riemannian frusf-region algorifhm. Secfion shows nu¬ 
merical experimenfal resulfs on bofh synfhefic fensorial 
dafa and real-world dafasefs. Finally Secfion [^concludes 
fhe paper. 
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2 Heterogeneous Tucker decomposi¬ 
tion MODEL FOR CLUSTERING 

In this section, starting with notation for tensors, we 
motivate the work, and finally propose fhe new hefero- 
geneous Tucker decomposition model for clusfering. 


2.1 Tensor Notation and Operations 

In the sequel, we follow the convention used in Q to 
denote ID vector by lowercase boldface symbols like v, 
2D matrix by uppercase boldface symbols like U and 
general tensors by calligraphy symbols like X. 

Let X G ]^Tix - x/„x - x/n ]v-order tensor with 

the {ii - ■ - in' ■ • tw)th element. The n-mode 
product of an iV-order tensor X with a matrix U„ G 
]j/,.x j„ jg denoted by X x„ U„. The result is an A^-order 
tensor of dimension /i x • • • x In-i x JnX In+i x • • • x In- 
Element wise, the n-mode product is expressed as 







inin + l---iN'^jnin 


Given an 7V-order tensor T” G K^x•••x/„x -x/jv^ .^Yg 
seek a Tucker model, as defined below, to approximate 
the tensor X as 


X XiUi X 2 U 2 X 3 -- - XjvUat 

^[0;Ui,U2,...,U^l, (1) 

where Q is an iV-order tensor of dimension Ji x • • • x 
Jn X ■ ■ ■ X Jn with J„ < In, Called the core tensor, and 
U„ G ^ is the matrix applied along mode-n. In this 
decomposition, the core tensor Q is interpreted as a lower 
dimensional representation of the tensor X. The Tucker 
decomposition is a form of higher-order PCA | |4T| , where 
all the factor matrices U„ are shared by a group of given 
tensors. 


2.2 Heterogeneous Tucker Decomposition Modei 

Most Tucker decomposition models are of a homoge¬ 
neous nature, by which we mean that all the factor 
matrices U„ satisfy the same constraint. For example, 
in the classical HOSVD Q, all the factor matrices are 
required to be orthogonal, i.e., UjU„ = Ij^ (denoted by 
I for simplicity). In the nonnegative Tucker decompo¬ 
sition model [24) , | [25l , all the factors are matrices with 
nonnegative entries, i.e., U„ > 0. 

However, the requirement for homogeneous factors 
XJn is not preferred in many cases. Especially, when 
the factor matrices in different modes have different 
interpretations. For example, consider the problem of 
clustering a set of images (2-order tensors). We can 
stack all the images onto a 3-order tensor, where the 
third mode corresponds to the number of images. For 
clustering the images, it is of interest to decompose the 
entire 3-order tensor in a way that, along the third mode, 
each image is represented by several cluster representa¬ 
tives, as done in fuzzy k-means algorithms. This can be 


achieved by ensuring that the last mode factor matrix, 
i.e., U 3 is nonnegative and the row sum of U 3 is 1 to 
mimic the cluster probability of an image. 

In general, we suppose that we are given a set of M 
{N — l)-order tensors, denoted by {Xi,X 2 ,...,Xm} and 
we want cluster them into K clusters. This can be done 
by projecting the tensors along all the first N —1 modes, 
then cluster them. To this end, we stack all the M {N— 1)- 
order tensors along the N mode, so that we have an 
iV-order tensor X in a way that each slice of X, along 
the last mode, is one of (iV — l)-order tensors Xi {I = 
1,2,..., M). Following the general notation in Section [2T[ 
we have In = M. 

Our proposed model is defined by the optimization 
problem 

min /(^,U) (2) 

e,U(’Ui=I,...,UT_iUjv_i=I,Ujvl=l.U„>0 

with 

f{g,U) = i||A’ - 0 Xi Ui X 2 U 2 X 3 • • • xat Uwlll, 

where 1 is a column vector of all ones, Ui, ...,Ujv-i are 
matrices whose columns are orthogonal, || • Ijp- denotes 
the Frobenius norm, and Uat G is nonnegative and 

the sum of each row is 1. In ||^, the dimension of G in 
mode N is K. 

If each of the K slices of G is interpreted as cluster 
centroids in the projected space rAx Jax -x Jw-i^ then the 
rows of Uv has the interpretation of cluster indicators. 

Given this heterogeneous Tucker decomposition 
model that is specifically aimed for clustering, we have 
a new t 5 rpe of matrix manifold, that is the cartesian 
product of the Stiefel manifolds, corresponding to the 
constraints Uf^Ui = I, ...,U^_ 2 UAr_i = I and the the 
multinomial manifold that corresponds toUivl = l,Ujv > 
0. It should be noted that the problem |j^ needs two 
different treatments in terms of optimizing their factor 
matrices, i.e., for the first {N — 1) orthogonal matrices 
and the tensor clustering indicator matrix Ujv, respec¬ 
tively. In the following we formulate these optimization 
problems. 


2.3 Optimization 

To reduce the number of optimization variables in prob¬ 
lem let us optimize the core tensor G when all the 
factor matrices are fixed to their current values. This 
becomes a least-squares problem. It is straightforward 
to prove that the solution is given by ||^ 

G = x XiUf X2--- xn -1 Ul-i xn [{VIUn)-^VI], 

( 3 ) 


where we have used the orthogonality of U„ (n = 
1,2,..., A — 1). It should be noted that Uv is column 
full rank. 
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Using in the objective function of l|^ resulfs in fhe 
following relation 

/= i|lA’-^XiUi X2U2 X 3 -- - x^UvIll^ 

= l\\xrF-l{gxNivj,VM),g) 

= l\\^\\% -l\\Xx,V^ X2--- Xn-1 VJj_,Xn 
xn[Un{VIUn)-^V]:j]\\1. 

Subsequenfly, minimizing ||^ is equivalenf fo 

max fi(U), (4) 

Uf Ui=I,...,U^_jU„_i=I.U„l = l,Ujv>0 

where 

b(U) = i||A’XiUf X2 --- 

x^_l X^ [UN{VjfVN)-^VN]fF- 

If should be emphasized fhaf fhe function h is smooth 
in fhe variables and fhe consfrainfs are separable. This 
mofivafes fo consider an alternating optimization scheme 
for Q, where we maximize wifh respecf fo one 
variable while fixing ofhers. This procedure is cyclically 
repealed. 

Lef Va, = UAr(U^Uv)”^U^. If all buf U„ (n = 
1,2, 1) are fixed, fhen U„ is optimized by solving 

an eigenvector problem. Specifically, consider updafing U„ 
(n = 1,2,..., N — 1) while all fhe ofhers being fixed. 
Denofe 

U(_„) = Vjv (8) • ■ ■ 0 U(„+i) 0 (g) • • • (g) Ui. 

Subsequenfly, Q can be rewritten as, after mode-n ma- 
fricizafion, 

U-u!=7 „ ^l|Un(X(„)Uf_„))|||, (5) 


wifh X(Ar) as fhe iV-mode mafricizafion of X and (g is 
fhe Kronecker producf of mafrices. 

Problem 0 is a nonlinear opfimizafion problem over 
fhe consfrainf {UatIUtvI = l,Ujv > 0 }, called fhe 
multinomial manifold, where 1 is fhe vecfor of all ones. 
An efticienf numerical algorifhm is needed. To fhis end. 


we propose an a frusf-region algorifhm in Secfion 4.1 


based on fhe Riemannian sfrucfure of fhe mulfinomial 
manifold fhaf is discussed in Secfion 

If should be nofed fhaf fhere exisf several efticienf 
algorifhms for linearly consfrained smoofh convex opfi¬ 
mizafion, e.g., p^ . However, if is nof clear whefher such 
approaches are efticienfly implemenfable for a sfrucfured 
nonconvex objective funcfion such as fhe one in |[^. 


3 The MULTINOMIAL MANIFOLD 

For fhe concepfs of general absfracf manifolds, we refer 
readers fo fhe fexfbook [ |43l . Each row of Ujv is a discrefe 
probabilify disfribufion which describes fhe membership 
over fhe fensor cenfroids. All fhe discrefe probabilify 
disfribufions make up fhe mulfinomial manifold (also 
called a simplex) defined by 

= |u = {ui,..., ukY e : Ufc > 0 , Ufc = 1 
I k=l 

If should be nofed fhaf fhe nonnegafivify consfrainf 
Uk > 0 is replaced wifh sfricf posifivify fo ensure 
fhaf fhe sef P*^ is differenfiable. A possible use of fhe 
mulfinomial manifold is in proposing a classifier wifh 
kernels, e.g., in [ |44) , |j4^, |46|. For fhe purpose of solving 
fhe opfimizafion problem we invesfigafe fhe producf 
manifold of mulfiple mulfinomial manifolds, still called 
fhe mulfinomial manifold, defined as 

Pm = ju = [Umk\ G = 1 

I fc=i 


where X(^ is fhe n-mode mafricizafion of fensor X. The 
problem 0 has a closed-form solufion given by, leffing 

Bn = X(n)Uj-_^y 

U„ = uf(B^), ( 6 ) 

where uf( ) exfracfs fhe orfhogonal facfor of fhe po¬ 
lar decomposition of ifs argumenf and is compufed as 
uf(A) = PQ^, where A = PSQ^ is fhe fhin singular 
value decomposition of A. 

As discussed in Secfion [2.2} fhe nonnegafivify con¬ 
sfrainf on Uat is specifically infroduced for clusfering, 
leading fo fhe opfimizafion problem 

U TV J- — Z Z 

= Hr(B^Uw(U^U^)-iU^BA,), (7) 

where 

Bat = X(Ar)[U ^_3 g • ■ • = X(Ar)[UAr-l g ■ ■ • gUi] 


Despife fhe use of fhe Fisher information mefric on 
fhe mulfinomial manifold in | |44| , | |4^ , fo fhe besf of 
our knowledge, fhe derivafion of fhe Riemarmian gra- 
dienf and fhe Riemarmian Hessian of a smoofh objective 
funcfion is new. The compufafions are shown in fhe 
subsequenf sections. 

3.1 The Submanifold Structure 

Let us represent an element of P^ with the notation U 
which is the matrix representation of size M x K. It is 
straightforward to show that the tangent space at U s 
P^ is given by 

TuPm = {^u G : pul = 0}, 

where 1 G is a column vector of all ones and 0 G 
is a column vector of all zeros. 

It should also be noted that, we can characterize the 
manifold P^ as an embedded Riemannian submanifold of 
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the Euclidean space equipped with the metric g 

(inner product), i.e., 

( 8 ) 

, ^ mk 

m,k 

where and gu belong to the tangent space TuF^ at 
the point U on the manifold . The metric gu defining 
fhe new Riemannian sfrucfure of fhe manifold is fhe 
called Fisher information metric [ |46| . The irmer producf 
defined in ||^ defermines fhe geomefry such as disfance, 
angle, curvafure on P^. 

The notion of Riemannian immersion helps in com- 
pufing fhe Riemannian gradienf and Hessian formulas 
on fhe manifold P^ in ferms of fheir formulas in fhe 
embedding space The basis of fhis is fhe fol¬ 

lowing orfhogonal, in fhe sense of fhe proposed mefric, 
projection operafor. 

To projecf a mafrix Z s onfo fhe fangenf 

space TuP^, we define fhe linear operation IIu : 

. Zt^nu(Z) as 

nu(Z) = Z - (al^)©U, 


where a = zi e 


pM 


and © is fhe elemenf-wise mafrix 


mulfiplicafion operafion. This projection operation is 
compufed by characferizing fhe fangenf space and ifs 
complemenfary space in fhe sense of fhe mefric (j^. 

Anofher imporfanf concepf in fhe recenf refracfion- 
based framework of Riemarmian optimization is fhe 
concepf of a retraction operafion p3) Secfion 4.1]. The 
refracfion mapping is used fo locafe fhe nexf iferafe 


on fhe manifold along a specified fangenf vecfor |33 


Chapfer 4]. The exponential map Exp^ is fhe canonical 
choice for fhe refracfion mapping, which generalizes 
fhe nofion of "following a sfraighf line" in fhe Eu¬ 
clidean space. However, in fhis paper we work wifh 
fhe following sfandard approximation fhaf is easier fo 
characferize. Given a fangenf vecfor € TuF^, fhe 
proposed refracfing mapping is 


U+ =i?u(Cu) 

:=(U © exp(f(^u 0 U))) 0 (U © exp((^u 0 U))ll^), 


where 0 is fhe elemenf-wise mafrix inversion and exp( ) 
is fhe elemenf-wise exponenfial operafor applied fo ma- 
frices. 


3.2 The Riemannian Gradient Computation 

Let GradF(U) be the Euclidean gradient of a smooth 
function F : i —> M with the Euclidean metric. The 

gradient in endowed with the metric g is scaled 

as GradF(U) © U. This can be attributed to the "change 
of basis" in the Euclidean space 

The expression of the Riemannian gradient gradF(U) 
on P^ is obtained by projecting the scaled-gradient 
GradT"(U) © U to the tangent space TuF^j, i.e., 

gradJ^(U) = nu(GradJ^(U) © U). (9) 


3.3 The Riemannian Hessian Computation 

In order to compute the Riemannian Hessian, we need 
the notion of the Riemannian connection pS) Section 5.5]. 
The Riemarmian connection, denoted as general¬ 

izes the covariant-derivative of the tangent vector ryu along 
the direction of the tangent vector on the manifold 
P^. Since P^ is a Riemarmian submanifold of 
the Riemannian cormection on P^ is also characterized 
by the projection of the corresponding cormection V^^ryu 
in the embedding space endowed with the metric 

i.e., V^uTyu = npr(V^u7yu) ||^ Proposition 5.3.2]. 

The connection V in the Euclidean space en¬ 

dowed with the metric (|^ is computed using the Koszul 
formula Theorem 5.3.1], and after a few steps of 
computations, it admits the matrix characterization 

= H7yu[Cu] - ^(Cu © Vu) 0 U. 

The Riemarmian Hessian HessT"(U)[^u] is the 
covariant-derivative of the Riemarmian gradient 
gradf]!!) in the direction G TuP^/ i-e-/ 

HessF(U)[Cu] = nu(V5„gradF(U)) 

=nu(HgradF(U)[eu] - © gradF(U)) 0 U), (10) 

where gradF(U) is the Riemannian gradient and 
ZlgradT"(U)[^u] is the Euclidean directional derivative 
of the Riemarmian gradient in the direction G TuP^, 
which is 

i9gradF(U)Ku] = Hnu(GradF(U) © 
=i9GradF(U)[eu] 0 U -h GradJ^(U) © 
-(al^)©eu-(t9aKu]l^)©U, (11) 

where a = (GradF(U) © U)l, i9a[^u] = 

(£iGradF(U)[^u]U + GradF(U) © Cu)l, and 
i9GradF(U)[^u] is the Euclidean directional derivative 
of the Euclidean gradient GradF(U) along the 

direction G TuF^, i.e., i9GradT"(U)[^u] = 

lim(GradF(U -h t^u) - GradF{V))/t. 


4 The Algorithm 

The optimization problem 0 in Section |2.3| is nonlinear, 
with linear constraints. To this end, we propose to use 
the Riemannian trust-region (RTR) optimization algo¬ 
rithm 133 Ghapter 7]. 


4.1 The Riemannian Trust-Region Aigorithm 

In order to solve (|^ we define F{IJn) = 

The problem 0 

boils down to the optimization problem of the form 

min F(U). (12) 

UGPJS 

In order to simplify the exposition in the subsequent 
sections, we use U and B in instead of Uat and B^r, 
respectively. Similarly, we use F instead of —F. 
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TABLE 1 

Optimization-related ingredients for |T^ . 


Matrix representation of an element in 
the Multinomial Manifold 

A matrix U of size M x K. 

mK 

Pft ~ {u - > 0, = 1}. 

Tangent vectors in TuPm 

{^u ^ : ^ul = 0}, where 1 ^ R^ is a column vector of 

ones. 

Metric gu(^u,gu) for any ^u,gu £ 
TuP^ 

l,k 

Projection of Z e onto the 

tangent space TuPm with Hu 

IIu(Z) := Z— (al^)©U, where a = Zl^ and © is the element¬ 
wise matrix multiplication operation. 

Retraction iiu(Cu) that maps a search 
direction onto Pm 

(U © exp(i(^u 0 U))) 0 (U © exp(i(^u 0 U))ll^), where 0 is 
the element-wise matrix inversion and exp(-) is the element-wise 
exponential operator. 

Riemannian gradient gradT’(U) 

nu(GradT’(U) © U), where Gradi^jU) denotes the Euclidean 
gradient of function F. 

Riemannian Hessian HessF(U)[^u] 
along £ TuPm 

nu(HgradT(U)[^ul - ^(Cu © gradT(U)) 0 U), where 
T)gradF(U)[^ul is defined in (TT). 


The RTR algorithm is a generalization of the classical 
unconstrained trust-region (TR) method to Riemannian 
manifolds. If is a mafrix-free and globally convergenf 
second-order mefhod suifable for large-scale opfimiza- 
fion on Riemarmian manifolds. Each iterafion consisfs of 
fwo sfeps: ( 1 ) approximafing fhe solufion of fhe trust- 
region subproblem and ( 2 ) compufing a new iferafe based 
on fhe refracfing mapping, defined in Secfion 3.1 The 
frusf-region subproblem af U S is defined as 


min F(U)-L5u(gradJ^(U),^u) 

CueruPj^,||ju||<A 

+ ^gu(HessF(U)[eu],^u), (13) 

where g is fhe Riemannian mefric ([S), A is fhe frusf- 
region radius, and ||^u|| = \/ 5 u(Cu,l^u)- ® amounfs 
fo minimizing a quadratic model of fhe objecfive funcfion 
wifhin a frusf-region radius of A. Here, gradF(U) is 
fhe Riemannian gradienf of F and HessT"(U)[^u] is the 
Riemannian Hessian of F along ^u- 

The Riemannian gradienf and Hessian of an objecfive 
funcfion on fhe manifold can be compufed using fhe 
expressions in (To) and (TT) from fhe Euclidean gra¬ 
dienf GradF(U) and Euclidean Hessian IlGradF(U)[^u] 
counferparfs. 

Eor our objecfive funcfion 

J^(U) = -Hr(B'^U(U'^U)-iU'^B), 
if is sfraighfforward fo check, using mafrix calculus (47), 


fhat 

GradJ^(U) 

= - BB^U(U^U)-^ -L U(U^U)-^U^BB^U(U^U)-^ 
and 

HGradF(U)[eu] 

= - BB^^u(U^U)-i -L Cu(U^U)-i(U'^BB^U)(U'^U)-i 
2 BB^U(U^U)"^sym(CSU)(U'^U)"^ 

-h 2U(U^U)-isym(CSBB^U)(U^U)-i 
-2U(U^U)-isym(U^Cu)(U^U)-iU^BB^U(U^U)-i 
-2U(U^U)-iU'^BB^U(U^U)-isym(U^^u)(U^U)-\ 

where sym(A) = (A-|-A^)/2 exfracfs fhe symmefric parf 
of A. 

Once fhe expressions of fhe Riemannian gradienf and 
Hessian are obfained, we use fhe Riemarmian frusf- 
region (RTR) algorifhm implemenfed in fhe Manopf fool- 
box (48) . Wifh all fhe ingredienfs summarized in Table 
fhe main sfeps of fhe RTR algorifhm are presented in 
Fig- a for fhe paper fo be self-confained. 


4.2 Overall Algorithm 

The overall algorithm for (^ is based on an alfernafing 
opfimizafion scheme in which one mafrix variable is up- 
dafed while fixing fhe ofher mafrix variables. In Secfion 

2.3 fhe updafes of fhe mafrix variables Ui, U 2 ,..., U n-i 


are shown in closed form. The updafe of fhe lasf mode 
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Require: An initial guess Uq on the manifold K- 
Ensure: The minimum U for fhe objecfive funcfion F. 

1 : Confinue fhe following for loop unfil a convergence 
criferion is safisfied. 

2 : for i = 1,2,... do 

3: Approximafely minimize fhe frusf-region subprob¬ 

lem (T^ for a new direcfion ^u- 

4: Consfrucf fhe new frial iferafe by using refracfion 

mapping U+ = i?u(Cu)- 

5: Updafe fhe iferafe by rejecfing or accepfing U+ de¬ 

pending on ifs qualify of decrease in fhe objecfive 
funcfion. 

6 : Updafe fhe frusf-region radius A. 

7: end for 


Remark 3: The algorifhm in Fig. sfarfs fhe for loop 
wifh updafing fhe membership informafion U^r. We 
can also sfarf fhe for loop wifh updafing factor mafrix 
Ui, U 2 ,..., Utv-i. In fhis case, we can use k-means to 
make an esfimafe for Uv- 

Remark 4: The overall algorifhm is an alfernafing 
opfimizafion scheme for Q. The general convergence 
analysis of alfernafing opfimizafion schemes is discussed 
in Secfion 8.9]. The convergence analysis of RTR is 
discussed in p3} Chapfer 7]. If should be emphasized 
fhaf fhe objecfive funcfion in ( ^ is nonlinear and RTR, in 
general, converges fo a crifical poinf or a local minimum. 
Nonefheless, fhis suffices for fhe scheme in Fig. ^ for 
solving Q, which is a challenging opfimizafion problem. 


Fig. 1. The Riemannian trust-region aigorithm. 


Un is computed by solving |[^ with the Riemannian 


trust-region algorithm discussed in Section 4.1 The pro¬ 


cedure of cyclically updating the variables is repeated 
until a convergence criterion is satisfied. Finally, the 
membership information Ujv is used with a clustering 
algorithm, such as the k-means, to learn the final clus¬ 
tering parameters. 

The overall algorithm is summarized in Fig. 


Require: Tensorial data X = {Xi,..., Xm}, dimensions 
Ii,l 2 , ■. ■, In-1 and the number of clusters K. 
Ensure: Factor Matrices Ui,U 2 ,...,Uv-i and Clusters. 

1 : Initialize Ui, U 2 ,..., Ujv-i. 

2 : Continue the following for loop until a convergence 
criterion is satisfied. 

3: for i = 1,2,... do 

4: Call the Riemannian trust-region algorithm (RTR) 

in Fig. to compute U^v. 

5: for n = 1 , 2,..., iV — 1 do 

6 : Solve (|^ for U„ by using ||^. 

7: end for 

8 : end for 

9: Do k-means over U ^ for the final clustering. 

Fig. 2. The overall algorithm for Q. 

Remark 1: The tensor clustering model and algorithm. 
Adaptive Subspace Iteration on Tensor (AST-T), pro¬ 
posed in | [3T| is a special version of the HOSVD model. 
The idea is to combine tensor projection and k-means by 
minimizing the distance of projected tensors and the one 
of K tensor centroids. The objective function is different 
from our formulation. AST-T uses a heuristic method to 
approximately solve the relevant optimization problem. 

Remark 2: Our model is also different from the fuzzy k- 
means algorithm in which the fuzzy cluster membership 
parameters are applied over the distance between data 
and centroids. In our model, we regard each data as a 
linear combination of centroids under relevant member¬ 
ship coefficients. 


4.3 Initialization 

Since we have a nonlinear objective function to be 
optimized over the "curved" multinomial manifold, an 
initialization has an impact on the final result. In our 
experiments, shown in the next section, we consider the 
following strategies for initialization. 

Random Initialization: This suggests using randomly 
produced orthogonal basis as each of the factor matrices 
Ui, U2,..., Utv-i. 

HOSVD Initialization I: Given the data tensor X, we 
conduct the HOSVD or HOOT Keeping the orthogonal 
matrix factors Ui,U 2 ,...,Ujv_i fixed to the initial values, 
we update U^r by using the Riemannian trust-region 
algorithm. However, the orthogonal matrix factors gen¬ 
erated by the HOSVD algorithm may not accurately 
represent data for clustering. In other words, the ini¬ 
tialization for U^f from the HOSVD is far from a good 
membership representation of clusters. As a result, the 
the first call to the RTR algorithm requires a larger 
number of iterations, e.g., 1000 . 

HOSVD Initialization II: The HOSVD or HOOI decom¬ 
position is designed for a single tensor. As we have to 
single out the last mode for the purpose of clustering, 
we design a similar HOSVD algorithm for the first 
(N — 1) modes. In other words, the resulting problem 
is over a group of (N — l)-order tensors {Xi}^^ for 
which we learn their Tucker decompositions under a 
fixed set of factor matrices {Ui, ...jU^v-i} such that the 
optimization problem 

M 

\\Xi — t/i Xi Ui X 2 • • • xjv-i Uv-i||f 

i=l 

is minimized. This problem can be solved in a way 
similar to the problem of computing the single tensor 
HOSVD. However, this initialization may not provide 
any information relating to the tensor representation in 
terms of the K centroids along the A-th mode. 

5 Numerical Experiments 

In this section, we present a set of experimental re¬ 
sults on S 5 rnthetic and real-world datasets with high 
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dimensional spatial structures. The intention of these 
experiments is to demonstrate the performance of our 
proposed model in comparison fo a number of sfafe-of- 
fhe-arf clusfering mefhods. The algorifhm proposed in 
Fig. compefe effecfively wifh fhe benchmark mefhods 
in ferms of predicfion accuracy. 


5.1 Evaluation Metrics 

To quantitatively evaluate the clustering results, we 
adopt two evaluation metrics, the accuracy (AC) and the 
normalized mutual information (NMI) metrics pO) . Given a 
data point x^, let L and L be the ground truth label and 
the cluster label provided by the clustering approaches, 
respectively. The AC measure is defined by 


i—1 ^ ' 


where M is the total number of samples and the function 
(5(a, b) is set to 1 if and only if a = b, and 0 otherwise. 
The operator Map(-) is the best mapping function that 
permutes L to match L, which is usually implemented 
by the Kuhn-Munkres algorithm |j^. 

The other metric is the normalized mutual information 
measure between two index sets L and L, defined as 


NMI(L,L) =- MI(L,£) 

maxfH 

where H(£) and H(L) denote the entropy of L and L, 
respectively, and 



p (y) and p (x) denote the marginal probability distribu¬ 
tion functions of L and L, respectively, and p (x, y) is 
the joint probability distribution function of L and L. 
NMI(£,£) ranges from 0 to 1, for which the value 1 
means that the two sets of clusters are identical and the 
value 0 means that the two are independent. Different 
from AC, NMI is invariant to permutation of labels, i.e., 
it does not require the matching processing in advance. 


5.2 Dataset Description 

This section describes the real-world datasets that we 
use for assessing the performance of various clustering 
algorithms. 

CBCL face dataset: This face datasej^ contains images 
of size 19 X 19. The goal of clustering for this dataset is 
to cluster the images into two different classes: face and 
non-face. 

MNIST dataset: This handwritten digits datasej^ con¬ 
sists of 70,000 handwritten digit images, including a 

1. http://cbclmit.edu/projects/cbcl/software-datasets/FaceData2. 
html 

2. http://yann.lecun.com/exdb/mnist/ 


training set of 60,000 examples and a test set of 10,000 
examples. It is a subset extracted from a larger set avail¬ 
able from NIST. The digits have been size-normalized 
and centered in the fixed-size 28 x 28. The images are 
in grey scale and each image can be treated as a 784- 
dimension feature vector or a 28 x 28 second order tensor. 
This dataset has 10 classes corresponding to the digits 0 
to 9, with all the images being labeled. 

PIE dataset, CMU: The CMU Pose, Illumination, and 
Expression (PIE) datasej^ consists of 41,368 images of 
68 people. Each subject was imaged under 13 different 
poses, 43 different illumination conditions, and with 4 
different expressions. In this paper, we test the algo¬ 
rithms (ours and the benchmarks) on the Pose27 sub¬ 
dataset as described in | [52) . The image size is 32 x 32. 

ORE dataset: The AT&T ORE datase^f] consists of 10 
different images for each of 40 distinct subjects, thus 400 
images in total. All the images were taken against a dark 
homogeneous background with the subjects in an up¬ 
right, frontal position, under varying lighting, facial ex¬ 
pressions (open/closed eyes, smiling/not smiling), and 
facial details (glasses/no glasses). Eor our experiments, 
each image is resized to 32 x 32 pixels. 

Extended Yale B dataset: Eor this datasej^ we use the 
cropped images and resize them to 32 x 32 pixels. This 
dataset now has 38 individuals and around 64 near 
frontal images under different illuminations per individ¬ 
ual. 

Dynamic texture dataset: The D 5 mTex++ datasej^ con¬ 
sists of video sequences that contains d 5 mamic river 
water, fish swimming, smoke, cloud and so on. These 
videos are labeled with 36 classes and each class has 
100 subsequences (a total of 3600 subsequences) with 
a fixed size of 50 x 50 x 50 (50 gray frames). This is 
a challenging dataset for clustering because most of 
texture from different class is fairly similar. 


5.3 Assessing Initialization Strategies 

we propose three different initialization 


In Section 4.3 


schemes for the proposed tensor clustering algorithm. 
In this experiment, we assess the influence of those 
initialization strategies on the final clustering accuracy 
on the sub-dataset Pose27 from the PIE dataset. 

Before we present the experimental results, we de¬ 
scribe the parameter settings used in the experiments. 
We randomly select 1000 data from the sub-dataset 
Pose27. There are 24 different classes in this chosen sub¬ 
set. The image size is 32 x 32. We choose the size of core 
tensors to be 12 x 12. In the overall algorithm described 
in Eig. 1^ we fix the number of iterations to 250 as the 
recovered error does not vary much after 250 iterations 
in most testing cases. Eor the RTR algorithm in line 4 


3. http://www.ri.cmu.edu/research_project_detail.html7project_id 
=418&menu_id=261 

4. http://www.uk.research.att.com/facedatabase.html. 

5. http://vision.ucsd.edu/'leekc/ExtYaleDatabase/ExtYaleB.html 

6. http://www.bernardghanem.com/datasets 
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TABLE 2 

Results for random Initialization. 


Evaluation metric 

Mean 

Std Var 

Best 

Worst 

AC 

0.7120 

0.0394 

0.7630 

0.6150 

NMI 

0.8189 

0.0210 

0.8467 

0.7717 


of Fig. we use the default parameters proposed in the 
toolbox Manopt with the RTR maximal inner iteration 
number 30. For the first call to the RTR algorithm, we 
set the number of maximum outer iterations to 1000 
with Manopt's default initialization for Uat {N = 3 
in this case). For subsequent calls to RTR, we use 5 
iterations with the current Uat as the initial values for 
memberships. In each overall iteration, we repeat the 
calls to Lines 5-7 in Fig. twice to make sure that the 
factors U„ stable. 

We conduct 20 runs for randomly initializing Ui and 
U 2 . The statistics are reported in Table Under similar 
parameter settings, AC = 0.7620 and NMI = 0.8199 for 
HOSVD Initialization I and AC = 0.7240 and NMI = 
0.8132 for HOSVD Initialization II, respectively. Both 
the random initialization and HOSVD initialization II 
strategies give comparable results and are are shown in 
Fig. I^b). It should be noted that the reconstructed error 
does not improve with iterations. On the other hand. Fig. 
I^a) shows the error curve for the HOSVD Initialization 
I strategy. As HOSVD gives the best estimate in terms 
of the model error, the error goes up initially when 
we move away from the best orthogonal factor U 3 to 
a membership factor. Finally, it stabilizes as shown in 
Fig. I^a). This leads to the conclusion that among the 
three initialization strategies discussed in Section 4.3 
HOSVD initialization I is the initialization of choice. In 
subsequenf experiments, we initialize all the compared 
algorithms with HOSVD initialization I. 

5.4 Algorithms and Comparisons 

There exist many different clustering algorithms such as 
spectral clustering approaches | [53) a nd the recent popular 
matrix factorization approaches j21[, |[^. Since the Tucker 
decomposition is a multidimensional generalization of 
mafrix facforizafion, we focus on fhe comparison be- 
fween our proposed heferogeneous Tucker model wifh 
recent matrix factorization based clustering algorithms. 
The following mefhods are chosen as fhe benchmark 
for numerical comparisons. All fhe experiments are con¬ 
ducted on a desktop with an Intel Core 15-4670 CPU at 
3.40GHz and with RAM of 8.00GB. 

Multiplicative method for nonnegative matrix factorization 
(MM) i [2l| : This nonnegafive mafrix facforizafion algo¬ 
rithm is designed for vectorial dafa fhaf are organized 
info a dafa mafrix X, i.e., each column represenfs an 
image. The mafrix X is factorized info fhe producf of 
fwo nonnegafive mafrices such thaf X = WH. The 
columns of W are regarded as fhe basis vectors and 
H is interpreted as the weights, i.e., each data vector 




(b) HOSVD Initialization II 


Fig. 3. The errors against the number of iterations. 


is obtained by a weighted linear combination of the 
the basis vectors. The matrix H is also interpreted as 
containing clustering membership information. 

Alternating least-squares algorithm (ALS) i|5^; This al¬ 
gorithm is designed to solve the nonnegative matrix 
factorization problem of facforizing fhe nonnegative data 
matrix X as WH by alternating optimization over the fac¬ 
tor matrices W and H with the nonnegativity constraint. 

Bayesian nonnegative matrix factorization (B-NFM) 1^ : 
The model is based on the probabilistic framework in 
which the matrix factors W and H are regarded as 
random variables characterized by relevant statistical 
density functions. A Bayesian algorithm is proposed for 
solving fhe problem. 

Projected gradient method for nonnegative matrix factor¬ 
ization (PGM) i [22| f; This algorifhm resulfs from simul- 
faneously opfimizing over fhe nonnegative factors W 
and H to approximately decompose a nonnegative data 
matrix X as WH. Lin p2) proposes an efficient Newton 
algorithm. 

Dual regularized co-clustering method (DRCC) The 
DRCC algorithm is a data clustering algorithm that 
is based on semi-nonnegative matrix tri-factorization with 
dual regularization over data graph and feature graph by 
exploiting the geometric structure of fhe dafa manifold 
and fhe feafure manifold. 

Nonnegative tri-factor tensor decomposition (Tri-ONTD) 
i |27| f: This algorithm is designed for 3-order tensorial 
dafa. If proposes a tensor decomposition with nonneg¬ 
ative factor matrices. Since it directly deals with tensor 
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data, it can be considered as a benchmark algorithm for 
comparison with our proposed method. 

Benchmark PCA-B: To demonstrate the benefits of fhe 
proposed model and, parficularly, fhe Riemannian al- 
gorifhm, we fake fhe following revised algorifhm as 
anofher benchmark algorifhm. Insfead of fhe problem 
we propose fo consider fhe problem ||^ for a non- 
negafive membership mafrix Ujv only, i.e., fhe consfrainf 
on fhe sum of each row fo 1 is liffed. This makes fhe 
problem formulafion simpler, which can be solved by 
alfemafively opfimizing of each facfor mafrices, similar 
fo fhe procedure in Fig. In fhis case, opfimizing U n 
becomes a quadratic programming problem with nonnegative 
constraints, which can be solved, e.g., using fhe Maflab 
funcfion Isqlin.m. 

PCA-Multinomial: The algorifhm proposed in Fig. 
which is based on fhe proposed heferogeneous Tucker 
decomposition model discussed in Secfion 


2.2 


If should be nofed fhaf mosf above described al- 
gorifhms perform better fhan (or comparable in some 
cases) fo classical clusfering algorifhms, like fhe k-means 
p7| , | [5^ . Hence, we do nof compare wifh fhe classical 
clusfering algorifhms for vecforized dafa. 

The implemenfafions of fhe nonnegafive mafrix fac- 
forizafion algorifhms MM, ATS, B-NFM, and PGM are 
in fhe Maflab NMF foolbox provided by Kasper Winfher 
Joergensen. The Maflab code for DRCC is provided by 
fhe aufhors of | |56) . We implemenf all fhe ofher algo¬ 
rifhms. Our proposed algorifhm, PCA-Mulfinomial, is 
implemenfed in Maflab using fhe Manopf foolbox 


Experiment I 

We firsf fesf all fhe algorifhms on fhe CBCL face dafasef 
as fhis is fhe simplesf case where fhere are only two 
clusfers of images fo be learned. The dafasizes are chosen 
from 200 fo 1600 by 100 for each class and fhe experimenf 
resulfs indicafe fhaf when size = 600 almosf all fhe 
algorifhms achieved similar accuracy. Under fhis size, we 
also fesf differenf fensor core sizes varying from 4 fo 10 
and find better resulfs for our proposed fensor clusfering 
mefhod in fhe case of fensor core size 8. 

Table shows fhe means and variances (in brackefs) 
for all fhe algorifhms after 10 runs. In Tablewe see fhaf 
all fhe algorifhms are comparable for fhe case when fhe 
number of clusfers is 2. For fhis experimenf, DRCC has 
shown beffer performance fhan all fhe ofher algorifhms. 
However, we also observe fhaf in experimenfs when K = 
2 fhe performance of DRCC degrades. The benchmark 
algorifhm PCA-B is slighfly beffer fhan fhe proposed 
PCA-Mulfinomial by 1.8% in AC. However, fhe variance 
of PCA-Mulfinomial is beffer in fhis experimenf showing 
robusfness of fhe Riemannian opfimizafion algorifhm. 


Experiment it 

In fhis experimenf, we assess fhe impacf of fhe number 
of classes on fhe performance of all fhe algorifhms. 
For fhis purpose, we conducf all fhe algorifhms on 
fhe MNIST handwriffen digifs dafasef. There are fen 



Fig. 4. The first row shows the Reconstructed Digits 2, 
3, 7 and 9 from the iearned centroids. The second row 
shows the reconstructed digits with positive vaiues. 


different classes, corresponding to the digits 0 to 9. Most 
machine learning algorithms work well for the case of 
two clusters, i.e., K — 2. For K > 2, the performance of 
different algorithms varies, observing which is the main 
focus of the present experiment. 

We randomly pick digits for training data according 
to the class number K = 3,4, with 100 images 
for each class. Once the classes are decided randomly, 
we run all the algorithms 5 times each with randomly 
chosen digits in the classes. The experimental results are 
reported in Table in terms of AC and NMI evaluation 
metrics. It should be noted that Tri-ONTD algorithm fails 
for most of cases, and therefore, no results are reported. 
The results in Table |4 show that the proposed PCA- 
Mulitnomial performs better in the cases of more than 
four classes. This is consistent with the results from the 
first experiment, where we observe a similar behavior. 
For K = 3,4, the performance of PCA-Multinomial is 
comparable with that of MM and PGM in both AC and 
NMI metrics. 

Figure 1^ shows the reconstructed representatives from 
the learned centroids for K = 4, given by X* = Q* Xi 
Ui X 2 • • • X AT-i U N-i with the learned low dimensional 
centroids G*- 

Experiment ill 

A test similar to Experiment II is conducted on the PIE 
dataset to further confirm the main conclusion form 
Experiment II, that the performance of PCA-Multinomial 
is better when are a larger number of clusters are sought. 
We do not report the performance of the Tri-ONTD 
algorithm as it does not provide meaningful results in 
our experiments. The PIE dataset contains 68 clusters, 
each with 42 objects. In this experiment, we test the 
algorithms on the data of 68, 58, 48, 38, 28, 18, and 8 
clusters, respectively. To maintain the data size to be 
around 600, we randomly pick 9 images for the case of 
cluster 68, 11 for 58, 13 for 48, 16 for 38, 21 for 28, 33 for 
18, and 42 for 8. The procedure is repeated for 5 runs. 
In the last case, the data size is 336 and each run is with 
8 different clusters. 
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TABLE 3 

Results for the CBCL face dataset with 1200 randomly chosen data (600 from each class). 


Evaluation metric 

MM 

ALS 

B-NFM 

PGM 

DRCC 

Tri-ONTD 

PCA-B 

PCA-Mulitnomial 

AC 

0.6755 

0.6424 

0.6305 

0.5976 

0.7352 

0.5764 

0.6394 

0.6208 


(5.494e-4) 

(3.255e-4) 

(3.594e-4) 

(2.073e-3) 

(1.367e-3) 

(3.503e-5) 

(3.301e-3) 

(1.018e-4) 

NMI 

0.0928 

0.0604 

0.0509 

0.0334 

0.1766 

0.0824 

0.0669 

0.0430 


(6.351e-4) 

(2.505e-4) 

(2.296e-4) 

(7.794e-4) 

(2.777e-3) 

(1.836e-4) 

(1.181e-3) 

(5.524e-5) 


TABLE 4 

Results for the MNIST handwritten digits with 1000 randomly chosen images (100 from each class). 



Evaluation metric 

MM 

ALS 

B-NEM 

PGM 

DRCC 

PCA-B 

PCA-Muitinomiai 

K = 3 

AC 

0.9173 

0.9133 

0.8820 

0.9160 

0.9013 

0.8506 

0.9114 



(2.800e-4) 

(2.6111e-4) 

(1.644e-4) 

(1.800e-4) 

(3.589e-4) 

(5.274e-3) 

{1.700e-4) 


NMI 

0.7351 

0.7344 

0.6806 

0.7307 

0.7115 

0.6231 

0.7252 



(2.488e-3) 

(1.654e-3) 

(6.069e-4) 

(1.792e-3) 

(1.704e-3) 

(1.492e-2) 

(9.761e-4) 

K = A 

AC 

0.7140 

0.6520 

0.6780 

0.7195 

0.6575 

0.6670 

0.6800 



(8.518e-4) 

(8.611e-3) 

(4.138e-4) 

(5.325e-4) 

(3.237e-3) 

(4.576e-3) 

(4.906e-4) 


NMI 

0.4654 

0.4181 

0.4446 

0.4680 

0.4071 

0.4184 

0.4360 



(1.726e-3) 

(3.279e-3) 

(5.964e-4) 

(1.181e-3) 

(3.278e-3) 

(2.588e-3) 

(7.243e-4) 

K = 5 

AC 

0.6216 

0.7112 

0.6104 

0.6764 

0.6632 

0.6960 

0.7196 



(7.731e-3) 

(3.292e-4) 

(3.839e-3) 

(1.533e-3) 

(2.792e-4) 

(1.370e-3) 

(1.334e-3) 


NMI 

0.5201 

0.5590 

0.4992 

0.5373 

0.5198 

0.5242 

0.5547 



(2.080e-3) 

(1.018e-3) 

(2.921e-3) 

(9.196e-4) 

(2.156e-4) 

(1.130e-3) 

(5.525e-4) 

K = 6 

AC 

0.5370 

0.5337 

0.5156 

0.5466 

0.4983 

0.5283 

0.5540 



(4.249e-3) 

(3.626e-3) 

(3.812e-3) 

(4.257e-3) 

(1.061e-3) 

(5.234e-3) 

(6.386e-3) 


NMI 

0.4007 

0.3852 

0.3980 

0.3962 

0.3670 

0.3850 

0.4083 



(9.891e-4) 

(6.482e-4) 

(1.197e-3) 

(1.507e-3) 

(1.348e-3) 

(3.077e-3) 

(1.945e-3) 

K = 7 

AC 

0.5988 

0.5974 

0.5762 

0.5940 

0.6585 

0.6074 

0.6231 



(8.816e-5) 

(3.295e-3) 

(1.051e-3) 

(3.704e-3) 

(1.789e-3) 

(3.704e-3) 

{5.618e-3) 


NMI 

0.5382 

0.5378 

0.4972 

0.5342 

0.5629 

0.5097 

0.5401 



(7.593e-4) 

(1.607e-3) 

(7.238e-4) 

(9.776e-4) 

(9.366e-4) 

(1.204e-3) 

(3.828e-4) 

K = 8 

AC 

0.6455 

0.6162 

0.5930 

0.5942 

0.6270 

0.5923 

0.6485 



(3.217e-3) 

(8.523e-4) 

(3.880e-3) 

(3.267e-3) 

(2.312e-3) 

(4.726e-3) 

(4.496e-3) 


NMI 

0.5427 

0.5496 

0.4979 

0.5152 

0.5033 

0.4944 

0.5254 



(5.694e-4) 

(6.968e-4) 

(1.329e-3) 

(2.243e-3) 

(3.711e-3) 

(2.286e-3) 

(2.367e-3) 

K = 9 

AC 

0.5211 

0.5567 

0.5137 

0.5302 

0.5240 

0.5224 

0.5756 



(1.251e-3) 

(8.173e-4) 

(5.528e-4) 

(1.816e-3) 

(7.867e-4) 

(8.213e-4) 

(3.569e-3) 


NMI 

0.4621 

0.4716 

0.4343 

0.4720 

0.4493 

0.4449 

0.4774 



(3.209e-4) 

(3.473e-4) 

(3.890e-4) 

(5.279e-4) 

(8.81le-4) 

(1.130e-3) 

(5.103e-4) 

K = 10 

AC 

0.4980 

0.4800 

0.4474 

0.4958 

0.4668 

0.4800 

0.5126 



(1.154e-3) 

(9.220e-4) 

(5.823e-4) 

(1.833e-3) 

(2.846e-3) 

(2.761e-3) 

(1.785e-3) 


NMI 

0.4470 

0.4505 

0.4089 

0.4516 

0.4053 

0.4142 

0.4461 



(7.543e-4) 

(5.267e-4) 

(3.474e-4) 

(2.051e-3) 

(3.402e-3) 

(8.642e-4) 

(1.146e-3) 


The results are collected in Table H] PCA-Multinomial 
performs better than the others, except for fhe case 
K = 8, where MM fakes fhe lead. Parf of fhis behavior is 
due fo lack of sufficient data in fhe case of clusfer K = 8. 
However NMI scores are all over 80%, excepf for fhe case 
K = 8. Bofh DRCC and PCA-B do nof show meaningful 
resulfs on fhis dafasef. 

Experiment IV 

In fhis experimenf, we fesf fhe algorifhms on bofh fhe 
ORL dafasef of 40 subjecfs wifh 400 images and fhe YaleB 
exfended dafasef of 38 subjecfs wifh 2414 images. 

For fhe ORL dafasef, we randomly choose 10 subjecfs 
for five runs of fhe algorifhm. The resulfs in Table|^show 
fhaf fhe PCA-Mulfinomial algorifhm oufperforms all fhe 
ofher algorifhms on fhe ORL dafasef in ferms of bofh AC 
and NMI scores. 

For fhe YaleB exfended dafasef, we use all fhe images. 
Hence, only one fesf is conducfed and fhe resulfs are 
summarized in Table If should be nofed fhaf all fhe 


algorifhms show poor performance on fhe YaleB dafasef 
alfhough fhe PCA-Mulfinomial algorifhm has relafively 
beffer performance. This is due fo fhe large varianfs 
among fhis dafasef. 

Experiment V 

In fhis experimenf, we fesf fhe PCA-Mulfinomial al¬ 
gorifhm againsf fhe benchmark PCA-B algorifhm on 
fhe D 5 mTex++ dafasef. The ofher algorifhms are nof 
appropriafe for fhis dafasef. Tri-ONTD works for mafrix 
dafa, i.e., if can handle a sef of images or 3-order fensors, 
whereas fhe resf work wifh vecforial dafa, i.e., can 
handle only mafrices. The core fensor size is fixed fo 
(30,30,30). We conducf fesfs on fhe cases of up fo 12 
classes. In each fesf, fhe fofal number of fraining video 
subsequences is defermined fo 200 for 2 classes, 270 for 
3 classes, 360 for 4 classes, 480 for 6 classes, 600 for 
8 classes and 10 classes, and 660 for 12 classes. The 
classes are randomly chosen. For each case, five runs are 
conducfed by randomly choosing video subsequences in 
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TABLE 5 

Results on PIE face dataset with about 600 randomly chosen data. 



Evaluation metric 

MM 

ALS 

B-NFM 

PGM 

DRCC 

PCA-B 

PCA-Multinomiai 

00 

CD 

II 

AC 

0.6088 

0.5633 

0.6192 

0.5120 

0.2745 

0.4314 

0.6533 



(4.563e-4) 

(2.101e-4) 

(4.311e-4) 

(7.590e-4) 

(2.736e-4) 

(1.018e-3) 

(8.749e-4) 


NMI 

0.8066 

0.7791 

0.8099 

0.7540 

0.5652 

0.6828 

0.8634 



(2.222e-4) 

(4.083e-5) 

(9.111e-5) 

(9.963e-5) 

(3.231e-4) 

(3.197e-4) 

(2.484e-4) 

II 

Ol 

00 

AC 

0.6182 

0.5730 

0.5968 

0.5482 

0.2504 

0.4125 

0.6781 



(5.719e-4) 

(3.549e-3) 

(1.884e-4) 

(1.362e-3) 

(2.215e-4) 

(1.318e-3) 

(9.944e-4) 


NMI 

0.8027 

0.7790 

0.7955 

0.7580 

0.5140 

0.6555 

0.8742 



(2.504e-4) 

(6.398e-4) 

(2.789e-5) 

(2.096e-4) 

(2.995e-4) 

(8.198e-4) 

(1.483e-4) 

00 

II 

AC 

0.6112 

0.5964 

0.6147 

0.5951 

0.2875 

0.4228 

0.6990 



(1.625e-4) 

(6.027e-4) 

(1.777e-3) 

(5.105e-4) 

(4.925e-4) 

(6.271e-4) 

(7.863e-4) 


NMI 

0.7899 

0.7812 

0.7896 

0.7775 

0.5266 

0.6491 

0.8480 



(1.084e-4) 

(4.319e-4) 

(1.928e-4) 

(1.633e-4) 

(2.659e-4) 

(1.115e-4) 

(8.585e-5) 

00 

CO 

II 

AC 

0.5799 

0.6197 

0.6138 

0.5930 

0.2621 

0.3911 

0.7118 



(2.292e-3) 

(9.405e-4) 

(9.892e-4) 

(3.646e-4) 

(3.578e-4) 

(1.227e-3) 

(1.384e-3) 


NMI 

0.7552 

0.7769 

0.7717 

0.7553 

0.4461 

0.5949 

0.8516 



(3.956e-4) 

(6.884e-4) 

(5.941e-4) 

(1.132e-4) 

(1.461e-4) 

(5.438e-4) 

(1.716e-4) 

00 

C<l 

II 

AC 

0.6224 

0.6214 

0.6329 

0.6408 

0.2619 

0.4212 

0.6877 



(2.564e-3) 

(1.448e-3) 

(4.329e-4) 

(8.523e-4) 

(2.704e-4) 

(4.090e-3) 

(4.106e-3) 


NMI 

0.7470 

0.7610 

0.7586 

0.7519 

0.4033 

0.5669 

0.8153 



(4.750e-4) 

(5.707e-4) 

(1.975e-4) 

(3.880e-4) 

(3.592e-4) 

(2.179e-3) 

(5.873e-4) 

to = 18 

AC 

0.6367 

0.6387 

0.5892 

0.6511 

0.2962 

0.4865 

0.7329 



(6.368e-4) 

(1.002e-2) 

(5.057e-3) 

(2.232e-3) 

(2.326e-3) 

(6.932e-3) 

(1.352e-3) 


NMI 

0.7290 

0.7433 

0.7089 

0.7377 

0.3836 

0.5917 

0.8010 



(5.039e-4) 

(2.437e-3) 

(8.551e-4) 

(4.477e-4) 

(1.570e-3) 

(3.972e-3) 

(7.787e-4) 

to = 8 

AC 

0.6690 

0.6256 

0.6321 

0.6285 

0.3809 

0.4785 

0.6523 



(1.073e-3) 

(2.698e-3) 

(9.592e-4) 

(1.396e-3) 

(9.553e-3) 

(3.036e-2) 

(1.634e-2) 


NMI 

0.6512 

0.6797 

0.6382 

0.6289 

0.3110 

0.4403 

0.6803 



{4.287e-3) 

(5.646e-4) 

(6.623e-4) 

(1.509e-3) 

(1.269e-2) 

(3.941e-2) 

(1.057e-2) 


TABLE 6 

Results for both ORL and YaleB face datasets. 



Evaluation metric 

MM 

ALS 

B-NFM 

PGM 

DRCC 

PCA-B 

PCA-Multinomial 

ORL 

AC 

0.6440 

0.6820 

0.6280 

0.6300 

0.5580 

0.6360 

0.7340 



(3.730e-3) 

(9.070e-3) 

(8.070e-3) 

(3.650e-3) 

(4.870e-3) 

(5.130e-3) 

(2.230e-3) 


NMI 

0.7308 

0.7269 

0.7107 

0.7236 

0.6498 

0.6968 

0.7996 



(1.842e-3) 

(5.946e-3) 

(2.688e-3) 

(1.668e-3) 

(8.727e-4) 

(5.467e-3) 

(7.283e-4) 

YaleB 

AC 

0.2175 

0.2270 

0.2104 

0.2084 

0.1002 

0.2117 

0.3293 


NMI 

0.3534 

0.3470 

0.3508 

0.3477 

0.1455 

0.3402 

0.4772 


the chosen classes, except for the case of fwo classes, 
where fhe only 200 subsequences are all used in a single 
run. 

The resulfs are summarized in Table Parficularly, 
fhe resulfs show fhe challenging nafure of clusfering 
d 5 rnamic fexfures for K > 6 classes. In all fhe cases, 
PCA-Mulfinomial algorifhm consisfenfly performs beffer 
fhan fhe benchmark PCA-B. This once again confirms 
fhe benefifs of using fhe RTR algorifhm based on fhe 
Riemannian geomefry of fhe mulfinomial manifold. 

6 Conclusion 

We proposed a heferogeneous Tucker decomposifion 
model for fensor clusfering. The model simulfaneously 
conducfs dimensionalify reducfion and optimizes mem¬ 
bership represenfafion. The dimensionalify reducfion is 
conducfed along fhe firsf {N — l)-modes while optimiz¬ 
ing membership on fhe lasf mode of fensorial dafa. The 
resulting optimization problem is a nonlinear opfimiza- 
fion problem over fhe special mafrix mulfinomial man¬ 
ifold. To apply fhe Riemannian frusf region algorifhm 


fo fhe problem, we explored fhe Riemannian geomefry 
of fhe manifold under fhe Fisher informafion mefric. 
Finally, we implemenfed fhe clusfering algorifhms based 
on fhe Riemannian opfimizafion. We used a number 
of real-world dafasefs fo fesf fhe proposed algorifhms 
and compared fhem wifh exisfing sfafe-of-fhe-arf non- 
negafive mafrix facforizafion mefhods. Numerical resulfs 
illusfrafe fhe good clusfering performance of fhe pro¬ 
posed algorifhms parficularly in fhe case of higher class 
numbers. 

Furfher work can be carried ouf in several differenf 
directions. For example, in fhe currenf work we use 
an alfemafing procedure fo opfimize over all fhe factor 
mafrices Ui, U 2 , ..., U^v. However, fhis is nof fhe only 
way fo solve fhe problem. As fhe consfrainf U^U„ = I 
for n = 1,2,...,A — 1 defines fhe Sfiefel manifold, 
fhe overall opfimizafion problem is defined over 
fhe producf manifold of {N — 1) Sfiefel manifolds and 
one pn fhe mulfinomial manifold. This can be direcfly 
opfimized over fhe entire producf of N manifolds, e.g., 
using Manopf. Anofher issue fhaf needs fo be furfher 
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TABLE 7 

Results for the DynTex++ dataset. 


Models 

Evaluation metric 

2 Classes 

3 Classes 

4 Classes 

6 Classes 

8 Classes 

10 Classes 

12 Classes 

PCA-B 

AC 

0.5550 

0.4230 

0.3727 

0.2983 

0.2507 

0.1937 

0.2024 




( 4.069e-3) 

( 1.429e-3) 

( 4.088e-4) 

( 5.244e-4) 

( 8.527e-5) 

(3.416e-4) 


NMI 

0.1034 

0.1849 

0.2261 

0.2079 

0.2259 

0.2451 

0.2520 



- 

( 8.471e-3) 

( 1.667e-3) 

( 4.817e-4) 

( 2.926e-4) 

( 1.057e-3) 

(9.549e-5) 

PCA-Multinomial 

AC 

0.7901 

0.5827 

0.5767 

0.3842 

0.4007 

0.3300 

0.2980 




( 5.522e-4) 

( 6.350e-3) 

( 1.361e-3) 

( 1.325e-3) 

( 1.272e-3) 

( 2.519e-4) 


NMI 

0.4056 

0.2047 

0.3360 

0.3012 

0.3538 

0.3285 

0.3444 



- 

( 5.924e-4) 

( 6.116e-4) 

( 9.165e-4) 

( 5.356e-4) 

( 4.801e-4) 

( 1.563e-4) 


investigated is how to scale the algorithm for ||^ to high¬ 
dimensional data. 
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